BMJ Health & Care Informatics
● BMJ
Preprints posted in the last 30 days, ranked by how well they match BMJ Health & Care Informatics's content profile, based on 13 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Uzochukwu, B. S. C.; Cherima, Y. J.; Enebeli, U. U.; Okeke, C. C.; Uzochukwu, A. C.; Omoha, A.; Hassan, B.; Eronu, E. M.; Yusuf, S. M.; Uzochukwu, K. A.; Kalu, E. I.
Show abstract
Background: The integration of artificial intelligence (AI) into clinical practice holds transformative potential for healthcare in West Africa, but safe deployment requires context-appropriate governance, accountability, and post-deployment monitoring frameworks. This cross-sectional mixed-methods study examined preferences and concerns of West African clinicians and technical experts regarding AI governance structures, post-deployment surveillance mechanisms, and accountability allocation. Methods: A structured questionnaire was administered to 136 physicians affiliated with the West African College of Physicians (February 22-28, 2026), complemented by 72 key informant interviews with technical leads, AI developers, data scientists, policymakers, and healthcare leaders. Data were analyzed using descriptive statistics, inferential tests, and thematic analysis. Results: Clinicians strongly preferred independent regulatory bodies (40.4%) for overseeing AI tool performance, with high trust ratings (mean:4.3/5), while vendor self-monitoring received minimal support (3.7%, mean:2.4/5). Real-time dashboards were the most favored monitoring approach (41.9%). Clear accountability pathways (94.1%), algorithm transparency (91.9%), and real-time performance data (89.7%) were rated essential by majorities. Major concerns included clinicians being unfairly blamed for AI errors (76.5%), excessive vendor control (72.8%), and absence of clear reporting pathways (69.9%). Qualitative findings emphasized continuous performance tracking for accuracy, fairness, and bias; structured incident reporting; protocols for model drift and failure; and multi-layered governance combining independent oversight, institutional AI committees, and explicit liability frameworks. Conclusion: This study provides the first empirical evidence from West Africa on clinician preferences for AI governance. Findings offer actionable guidance for policymakers to build trustworthy, equitable, and safe AI integration frameworks that prioritize transparency, independent oversight, and clinician protection. Keywords: artificial intelligence; AI governance; post-deployment monitoring; accountability; West Africa; clinician preferences; health data science.
Thomas, C.; Kim, J. Y.; Hasan, A.; Kpodzro, S.; Cortes, J.; Day, B.; Jensen, S.; LHuillier, S.; Oden, M. O.; Zumbado Segura, S.; Maurer, E. W.; Tucker, S.; Robinson, S.; Garcia, B.; Muramalla, E.; Lu, S.; Chawla, N.; Patel, M.; Balu, S.; Sendak, M.
Show abstract
Safety net healthcare delivery organizations (SNOs) serve vulnerable populations but face persistent challenges in adopting new technologies, including AI. While systematic barriers to technology adoption in SNOs are well documented, little is known about how AI is implemented in these settings. This study explored real-world AI adoption in SNOs, focusing on identifying barriers encountered across the AI lifecycle and strategies used to overcome them. Five SNOs in the U.S. participated in a 12-month technical assistance program, the Practice Network, to implement AI tools of their choosing. Observed barriers and mitigation strategies were documented throughout program activities and, at the conclusion of the program, reviewed and refined with participants using a participatory research approach to ensure findings reflected lived experiences and organizational contexts. Key barriers emerged during the Integration and Lifecycle Management phases and included gaps in AI performance evaluation and impact assessments, communication with patients about AI use, foundational AI education, financial resources for purchasing and maintaining AI tools, and AI governance structures. Effective strategies for addressing these barriers were primarily supported through centralized expertise, structured guidance, and peer learning. These findings provide granular, actionable insights for SNO leaders, offering guidance for anticipating barriers and proactively planning mitigation strategies. By including SNO perspectives, the study also contributes to the broader health AI ecosystem and underscores the importance of participatory, collaborative approaches to support safe, effective, and ethical AI adoption in resource-constrained settings. Author SummarySafety net organizations (SNOs) are healthcare systems that primarily serve low-income and underinsured patients. While interest in artificial intelligence (AI) in healthcare has grown rapidly, little is known about how these organizations experience AI adoption in practice. In this study, we partnered with five SNOs over a 12-month program to document the challenges they encountered when implementing AI tools and the strategies they used to address them. We worked closely with SNO staff throughout the process to ensure our findings reflected their lived experiences with AI implementation. We found that the most common challenges arose when organizations tried to integrate AI into daily operations and monitor and maintain those tools over time. Specific barriers included difficulty evaluating whether AI was performing as expected, limited guidance on communicating with patients about AI use, a lack of resources for staff training, limited financial resources, and the absence of formal governance structures. Successful strategies for overcoming these challenges drew on shared knowledge and structured support provided by the program, as well as learning from peer organizations. These findings offer practical guidance for SNO leaders planning or managing AI adoption, and contribute to a broader conversation about what is required to implement AI safely and effectively in healthcare settings that serve the most medically and socially vulnerable patients.
Jafarifiroozabadi, R.
Show abstract
Background: Safety is a critical concern in behavioral health crisis units (BHCUs), where environmental risks (e.g., ligature points) can lead to injury to self or others. However, limited research has examined how perceived safety influences facility selection among patients and care partners, or how these perceptions align with AI-driven safety risk assessments in such environments. Method: To address these gaps, a nationwide discrete choice online survey was conducted using image-based scenarios of BHCU environments, where participants selected preferred facilities based on a range of attributes, including environmental safety risks (e.g., ligature points). Additionally, participants identified safety risks in survey images, which were compared with outputs from an AI-driven tool developed and trained to detect environmental risks by experts. Quantitative analysis using conditional logit models examined the influence of attributes on facility choice, while spatial comparisons of annotated images and heatmaps assessed participant and AI-identified risk alignments. Results: Findings revealed that the higher frequency of safety risks in images significantly reduced the likelihood of facility selection (p < .001, OR {approx} 1.28), highlighting the importance of perceived safety in user decision-making. While there was notable alignment between heatmaps generated by participants and AI, key differences emerged, suggesting that participant safety perception was influenced by features not fully captured by AI, such as the type of materials or unknown, out-of-label safety risks in facility images. Conclusions: Despite these limitations, results highlighted the value of integrating AI-driven assistive tools for non-expert user safety risk assessment to support decision-making for safer BHCU environments.
Nkosi-Mjadu, B. E.
Show abstract
BackgroundSouth Africas public healthcare system serves most of the population through approximately 3,900 primary healthcare clinics characterised by long waiting times and high volumes of repeat-prescription visits. No published pre-arrival digital triage system operates across all 11 official South African languages while aligning with the South African Triage Scale (SATS). This paper reports the design and preliminary safety validation of BIZUSIZO, a hybrid deterministic-AI WhatsApp triage system. MethodsBIZUSIZO delivers SATS-aligned triage via WhatsApp, combining AI-assisted free-text classification (Claude Haiku 4.5) with a Deterministic Clinical Safety Layer (DCSL) that overrides AI output for 53 clinical discriminator categories (14 RED, 19 ORANGE, 20 YELLOW) coded in all 11 official languages and independent of AI availability. A five-domain risk factor assessment can only upgrade triage level. One hundred and twenty clinical vignettes in patient language (English, isiZulu, isiXhosa, Afrikaans; 30 per language) were scored against a developer-assigned gold standard with independent blinded nurse review. A 121-vignette multilingual DCSL safety consistency check across all 11 languages and a 220-call post-hoc framing sensitivity evaluation (110 paired vignettes) were also conducted. ResultsUnder-triage was 3.3% (4/120; 95% CI: 0.9%-8.3%) with no RED under-triage; exact concordance was 80.0% (96/120) and quadratic weighted kappa 0.891 (95% CI: 0.827-0.932). One two-level under-triage was observed on a non-RED presentation (V072, isiXhosa burns vignette, ORANGEGREEN); one two-level over-triage was observed (V054, isiZulu deep laceration, YELLOWRED). In the framing sensitivity evaluation, AI-only classification achieved 50.9% RED invariance under adversarial framing; full-pipeline classification achieved 95.0% in four validated languages, with the DCSL rescuing 18 of 23 AI drift cases. ConclusionsA hybrid deterministic-AI triage system with DCSL-based emergency detection achieved zero RED under-triage and consistent RED detection across all 11 official languages. The 16.7% over-triage rate falls within published South African SATS ranges (13.1-49%). A single two-level under-triage event was observed on an isiXhosa burns vignette (ORANGEGREEN) and is discussed in Limitations. Findings are preliminary; prospective validation against independent nurse triage is the necessary next step.
Vasquez-Venegas, C.; Chewcharat, A.; Kimera, R.; Kurtzman, N.; Leite, M.; Woite, N. L.; Muppidi, I. J.; Muppidi, R. J.; Liu, X.; Ong, E. P.; Pal, R.; Myers, C.; Salzman, S.; Patscheider, J. S.; John, T. R.; Rogers, M.; Samuel, M.; Santana-Guerrero, J. L.; Yaacob, S.; Gameiro, R. R.; Celi, L. A.
Show abstract
Computer vision models for chest X-ray interpretation hold significant promise for global healthcare, but their clinical value depends on equitable development across diverse populations. We conducted a scientometric analysis to examine authorship patterns, geographic distribution, and dataset origins to assess potential disparities that could affect clinical applicability. We systematically reviewed literature on computer vision applications for chest X-rays published between 2017-2025 across multiple databases, including PubMed, Embase and SciELO databases. Using Dimensions API and manual extraction, we analyzed 928 eligible studies, examining first and senior author affiliations, institutional contributions, dataset provenance, and collaboration patterns across different income classifications based on World Bank categories. High-income countries dominated research leadership, representing 55.6% of first authors and 59.7% of senior authors; no first authors were affiliated with low-income countries. China (16.93%) and the United States (16.72%) led in first authorship positions. Most datasets (73.6%) originated from high-income settings, with the United States being the largest contributor (40.45%). Private datasets were most frequently used (20.52%). Cross-income collaborations were rare, with only 3.9% of publications involving partnerships between high-income and lower-middle-income countries. Findings reveal substantial disparities in who shapes computer vision research on chest X-rays and which populations are represented in training data. These imbalances risk developing AI systems that perform inconsistently across diverse healthcare settings, potentially exacerbating healthcare inequities. Addressing these disparities requires coordinated efforts to develop globally representative datasets, establish equitable international collaborations, and implement policies that promote inclusive research practices.
Corga Da Silva, R.; Romano, M.; Mendes, T.; Isidoro, M.; Ravichandran, S.; Kumar, S.; van der Heijden, M.; Fail, O.; Gnanapragasam, V. E.
Show abstract
Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. Objective: To evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methods: In this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout. Results: Clinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The NPS was 81.2, with no detractors. Conclusions: Clinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted. Keywords: Medical AI assistant, LLMs in healthcare, Agentic AI, Clinical decision support, Point of care AI
Ng, J. Y.; Bhavsar, D.; Dhanvanthry, N.; Bouter, L.; Chan, T.; Cramer, H.; Flanagin, A.; Iorio, A.; Lokker, C.; Maisonneuve, H.; Marusic, A.; Moher, D.
Show abstract
Background: Artificial intelligence chatbots (AICs), as a form of generative artificial intelligence (AI), are increasingly being considered for use in scholarly peer review to assist with tasks such as identifying methodological issues, verifying references, and improving language clarity. Despite these potential benefits, concerns remain regarding their reliability, ethical implications, and transparency. Evidence on how medical journal peer reviewers perceive the role and impact of AICs is limited. This study explored reviewers' familiarity with AICs, perceived benefits and challenges, ethical concerns, and anticipated future roles in peer review. Methods: We conducted a cross-sectional online survey of medical journal peer reviewers. Corresponding author information was extracted from MEDLINE-indexed articles added to PubMed within a two-month period using an R-based approach. A total of 72,851 authors were invited via email to participate; those who self-identified as peer reviewers were eligible. The 29-item survey assessed familiarity with AICs and perceptions of their benefits and limitations in peer review. The survey was administered via SurveyMonkey from April 28 to June 16, 2025, with two reminder emails sent during the data collection period. Results: A total of 1,260 respondents completed the survey. Most participants were familiar with AICs (86.2%) and had used tools such as ChatGPT for general purposes (87.7%), but the majority had not used AICs for peer review (70.3%). Most respondents reported that their institutions do not provide training on AIC use in peer review (69.5%), although many expressed interest in such training (60.7%). Perceptions of AIC benefits were mixed, while concerns were widely shared, particularly regarding potential algorithmic bias (80.3%) and issues related to trust and user acceptance (73.3%). Conclusions: While familiarity with AICs is high among medical journal peer reviewers, their use in peer review remains limited. There is clear interest in training and guidance, however, concerns related to ethics, data privacy, and research integrity persist and should be addressed before broader implementation.
Miran, S. A.; Cheng, Y.; Faselis, C.; Brandt, C.; Vasaitis, S.; Nesbitt, L.; Zanin, L.; Tekle, S.; Ahmed, A.; Nelson, S. J.; Zeng-Treitler, Q.
Show abstract
ObjectivesTo develop and evaluate predictive models for unused outpatient appointments (missed or cancelled) using a large national electronic health record (EHR) repository in the United States. DesignRetrospective observational study using machine learning and statistical modeling. SettingA U.S. national electronic health record repository (Cerner Real World Database) covering healthcare encounters from 2010 to 2025. ParticipantsAdult patients aged [≥]18 years with routine outpatient encounters recorded in the database. One outpatient appointment with a known status was randomly selected per patient, resulting in a final analytic sample of 5,699,861 encounters. Primary and Secondary Outcome MeasuresThe primary outcome was whether the index outpatient appointment was attended or unused (missed or cancelled). Model performance was evaluated using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. MethodsPredictors included patient characteristics (demographics and insurance type), appointment characteristics (day, time, season, and urbanicity), prior cancellation rate, and time gap between the index appointment and the previous visit. We compared the predictive performance of two machine learning models (random forest classifier and extreme gradient boosting (XGBoost)) with logistic regression. An explainable AI analysis of feature impact was performed on the final XGBoost model. ResultsAmong 5,699,861 outpatient encounters, 3,650,715 (64.0%) were attended and 2,049,146 (36.0%) were unused. XGBoost achieved the best predictive performance on the test dataset (AUC = 0.95), followed by random forest (AUC = 0.92) and logistic regression (AUC = 0.89). Feature impact score analysis revealed highly non-linear associations between predictors and the risk of unused appointments at the individual level. ConclusionsUnused outpatient appointments can be accurately predicted using routinely available EHR data. Integrating predictive models into scheduling workflows may improve healthcare efficiency and optimize appointment management. Article SummaryStrengths and limitations of this study O_LIThis study used one of the largest national electronic health record datasets to develop predictive models for unused outpatient appointments. C_LIO_LIMultiple modeling approaches, including logistic regression and machine learning methods (random forest and XGBoost), were compared to evaluate predictive performance. C_LIO_LIAn explainable artificial intelligence method was applied to quantify feature impact and improve model interpretability. C_LIO_LIThe retrospective design and reliance on routinely collected EHR data may introduce data quality limitations and unmeasured confounding. C_LIO_LIThe database did not distinguish clearly between cancelled appointments and no-shows. C_LI
Blankson, P.-K.; Hussien, S.; Idris, F.; Trevillion, G.; Aslam, A.; Afani, A.; Dunlap, P.; Chepkorir, J.; Melgarejo, P.; Idris, M.
Show abstract
BackgroundRecruitment remains a major barrier to timely clinical trial completion. Trialshub is an LLM-powered, chat-based platform intended to help users identify relevant trials and connect with coordinators to streamline recruitment workflows. ObjectiveTo evaluate the perceived usability and operational value of Trialshub, and identify implementation considerations for real-world deployment. MethodsA usability test was conducted at Morehouse School of Medicine for the Trialshub application. Purposively selected participants included clinical research coordinators and individuals with and without clinical trial search experience. Participants completed a pre-test survey assessing demographics, digital health information behaviors, and familiarity with AI tools, followed by a moderated usability session using a Trialshub prototype. Users completed scenario-based tasks (locating a breast cancer trial, reviewing results, and initiating coordinator contact) using a think-aloud protocol. Task ratings, screen recordings, and transcribed feedback were analyzed descriptively and thematically, and reported. ResultsParticipants reported high comfort with using digital tools and moderate-to-high familiarity with AI. Trialshubs chat-first design, guided prompts, and checklist-style eligibility display were perceived as intuitive and reduced cognitive load. Fast access to trials and the coordinator-contact workflow were viewed positively. Key usability issues included uncertainty at step transitions, insufficient cues for selecting results and next actions, and inconsistent system reliability (loading delays, errors, and broken trial detail pages). Participants also noted redundant questioning due to limited conversational memory, requested improved filtering/sorting, and clearer calls-to-action. All participants indicated that Trialshub has strong potential to meaningfully improve clinical trial processes. ConclusionsTrialshub shows promise for improving trial discovery and recruitment workflows, with identified design implications for real-world deployment.
Streicher, N. S.
Show abstract
Background and ObjectivesPatient portals have become essential infrastructure for healthcare delivery following the 21st Century Cures Act, yet adoption remains inequitable. Understanding demographic and geographic determinants of portal activation is critical for addressing digital health disparities, particularly among neurology patients who face unique access barriers. We examined the demographic, geographic, and neighborhood-level factors associated with patient portal activation among neurology patients at multiple geographic scales in the Washington, DC metropolitan area. MethodsWe conducted a retrospective cohort study of 72,417 adult neurology patients seen at two academic medical centers sharing an electronic health record in Washington, DC (February 2021-February 2026). We examined portal activation using multivariable logistic regression and geographic analysis at four nested scales: the metropolitan catchment area, DCs eight wards, individual census tracts (via geocoded patient addresses), and individual DC residents. ResultsPortal activation was 64.7% overall. Activation varied by race/ethnicity (Non-Hispanic White 76.1%, Non-Hispanic Black 57.0%, Non-Hispanic Asian 57.6%, Hispanic 55.0%) and geography (DC Ward 2: 82.0% vs. Ward 7: 48.0%). Ward-level educational attainment (r = 0.948), broadband access (r = 0.889), and income (r = 0.811) were strongly correlated with activation. Within individual wards, Non-Hispanic White patients activated at 84-91% while Non-Hispanic Black patients activated at 48-64%, demonstrating that neighborhood resources alone do not explain disparities. DiscussionPatient portal activation is shaped by demographic, socioeconomic, and geographic factors operating at multiple levels. Persistent within-ward racial disparities indicate that geographically targeted interventions must be paired with culturally tailored approaches to achieve digital health equity.
Luisto, R.; Snell, K.; Vartiainen, V.; Sanmark, E.; Äyrämö, S.
Show abstract
In this study, we investigate gender bias in a Retrieval-Augmented Generation (RAG) based AI assistant developed for Finnish wellbeing services counties. We tested the system using 36 clinically relevant queries, each rendered in three gendered variants (male, female, gender-neutral), and evaluated responses using both an LLM-as-a-judge approach and a human expert panel consisting of a physician and a sociologist specializing in ethics. We observed substantial and clinically significant differences across gendered variants, including differential treatment urgency, inappropriate symptom associations, and misidentification of clinical context. Female variants disproportionately framed responses around childcare and reproductive health regardless of clinical relevance, reflecting societal stereotypes rather than medical reasoning. Bias manifested both at the LLM generation stage and the RAG retrieval stage, in several cases causing the model to hallucinate responses entirely. Some bias patterns were persistent across repeated runs, while others appeared inconsistently, highlighting the challenge of distinguishing systematic bias from stochastic variation.
Mahdikhani, S.; Cleary, F.; Cummins, S.
Show abstract
Objectives: Endometriosis affects approximately 10% of reproductive age women worldwide, yet care pathways remain fragmented and treatments have limitations. This study aimed to identify and categorize key stakeholders in endometriosis care in Ireland, assess their influence and interest in the digital health initiative, and identify drivers and barriers affecting uptake of innovative approaches to care. Methods: A virtual stakeholder mapping workshop was conducted with participants from healthcare, policy, education, technology, academia, and patient communities. Using a structured MS Teams Whiteboard, participants generated a stakeholder list, positioned stakeholders on an Influence-Interest Matrix, and provided qualitative insights on factors enabling or constraining engagement with digital health innovation. Results: Stakeholders were distributed across all four quadrants of the matrix. High-interest/high-influence stakeholders included the HSE, specialist centres, general practitioners, and the Endometriosis Association of Ireland. High-interest/low-influence groups comprised patients, families, and online communities, while policymakers, hospital managers, and the education sector were identified as high-influence but low-interest actors. Key drivers included strong patient advocacy, institutional support such as engagement from the HSE, and growing awareness of digital health tools. Major barriers encompassed prolonged diagnostic delays, resource constraints, gaps in clinical knowledge, technology anxiety, and challenges sustaining engagement. Conclusions: Stakeholder mapping provided an evidence-informed foundation for the VendoR project, revealing engagement gaps and leverage points critical for improving endometriosis care innovation. The findings highlight the need for intentional, well-resourced strategies that elevate patient voices, address systemic barriers, and ensure balanced representation, supporting the co-design, co-creation, and co-production of digital health interventions for sustainable, patient-centred care.
Ng, J. Y.; Tan, J.; Syed, N.; Adapa, K.; Gupta, P. K.; Li, S.; Mehta, D.; Ring, M.; Shridhar, M.; Souza, J. P.; Yoshino, T.; Lee, M. S.; Cramer, H.
Show abstract
Background: Generative artificial intelligence (GenAI) chatbots have shown utility in assisting with various research tasks. Traditional, complementary, and integrative medicine (TCIM) is a patient-centric approach that emphasizes holistic well-being. The integration of TCIM and GenAI presents numerous key opportunities. However, TCIM researchers' attitudes toward GenAI tools remain less understood. This large-scale, international cross-sectional survey aimed to elucidate the attitudes and perceptions of TCIM researchers regarding the use of GenAI chatbots in the scientific process. Methods: A search strategy in Ovid MEDLINE identified corresponding authors who were TCIM researchers. Eligible authors were invited to complete an anonymous online survey administered via SurveyMonkey. The survey included questions on socio-demographic characteristics, familiarity with GenAI chatbots, and perceived benefits and challenges of using GenAI chatbots. Results were analysed using descriptive statistics and thematic content analysis. Results: The survey received 716 responses. Most respondents reported familiarity with GenAI chatbots (58.08%) and viewed them as very important to the future of scientific research (54.37%). The most acknowledged benefits included workload reduction (74.07%) and increased efficiency in data analysis/experimentation (71.14%). The most frequently reported challenges involved bias, errors, and limitations. More than half of the respondents (57.02%) expressed a need for training to use GenAI chatbots in the scientific process, alongside an interest in receiving training (72.07%). However, 43.67% indicated that their institutions did not offer these programs. Discussion: By developing a deeper understanding of TCIM researchers' perspectives, future AI applications in this field can be more informed, and guide future policies and collaboration among researchers.
Tai, K. H.; Varvara, G.; Escoffier, E.; Mansmann, U.; DeVito, N. J.; Vieira Armond, A. C.; Naudet, F.
Show abstract
Objective To map the presence, public availability, and content of clinical trial data sharing policies (DSP), data management and sharing plans (DMSP), and data use agreements (DUA) among the most prolific public and private clinical trial sponsors operating in the European Union, and to identify key areas of convergence, divergence, and constraint in the context of General Data Protection Regulation (GDPR). Eligibility criteria We included organisation-level documents describing approaches to clinical trial data sharing or data management from the top 20 public and top 20 private sponsors ranked by the number of trials registered in the EU Clinical Trials Information System (CTIS). Eligible materials comprised publicly available or sponsor-shared policies, guidelines, statements, templates, and agreements relevant to clinical trial data sharing or management. Sources of evidence Evidence was identified through systematic searches of sponsors' public websites, structured Google searches, and major data management plan platforms (DMPTool, DMPonline, DMP Assistant), complemented by direct contact with sponsors to verify findings and request missing documentation. All sources were archived and catalogued. Charting methods Two reviewers independently extracted data using a structured form, capturing the existence, accessibility, and content of data sharing policies, data management and sharing plans, and data use agreements. Quantitative data were summarised descriptively, and a non-interpretive descriptive content analysis was conducted to characterise recurring policy elements and areas of heterogeneity. Results Among 40 sponsors, private sponsors were substantially more likely than public sponsors to make trial-specific data sharing policies and data use agreements publicly accessible, often via established data sharing platforms. Public sponsors more frequently referenced data management and sharing plans, but these were heterogeneous in scope and often embedded within broader institutional governance documents rather than tailored to clinical trials. Across sectors, GDPR compliance, data protection, and legal safeguards were emphasised, while operational aspects such as dataset readiness, review criteria, and downstream responsibilities varied widely. Overall response rate to sponsor verification was 37.5%. Conclusion Clinical trial data sharing governance in the EU shows a marked sectoral imbalance among the top sponsors. Private sponsors tend to provide more detailed and operationally explicit documentation, whereas public sponsors often articulate high-level commitments without trial-specific guidance. Greater clarity and standardisation, particularly among public sponsors, could improve transparency and facilitate responsible data reuse, while remaining compatible with GDPR requirements.
Gartlehner, G.; Banda, S.; Callaghan, M.; Chase, J.-A.; Dobrescu, A.; Eisele-Metzger, A.; Flemyng, E.; Gardner, S.; Griebler, U.; Helfer, B.; Jemiolo, P.; Macura, B.; Minx, J. C.; Noel-Storr, A.; Rajabzadeh Tahmasebi, N.; Sharifan, A.; Meerpohl, J.; Thomas, J.
Show abstract
Background: Artificial intelligence (AI) has the potential to improve the efficiency of evidence synthesis and reduce human error. However, robust methods for evaluating rapidly evolving AI tools within the practical workflows of evidence synthesis remain underdeveloped. This protocol describes a study design for assessing the effectiveness, efficiency, and usability of AI tools in comparison to traditional human-only workflows in the context of Cochrane systematic reviews. Methods: Members of the Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods Project developed an adaptive platform study-within-a-review (SWAR) design, modeled after clinical platform trials. This design employs a master protocol to concurrently evaluate multiple AI tools (interventions) against a standard human-only process (control) across three key review tasks: title and abstract screening, full-text screening, and data extraction. The adaptive framework allows for the addition or removal of AI tools based on interim performance analyses without necessitating a restart of the study. Performance will be assessed using metrics such as accuracy (sensitivity, specificity, precision), efficiency (time on task), response stability, impact of errors, and usability, in alignment with Responsible use of AI in evidence SynthEsis (RAISE) principles. Results: The study will generate comparative data about the performance and usability of specific AI tools employed in a semi- or fully automated manner relative to standard human effort. The protocol provides a flexible framework for the assessment of AI tools in evidence synthesis, addressing the limitations of static, one-time evaluations. Discussion: This study protocol presents a novel methodological approach to addressing the challenges of evaluating AI tools for evidence syntheses. By validating entire workflows rather than individual technologies, the findings will establish an evidence base for determining the viability of integrating AI into evidence-synthesis workflows. The adaptive design of this study is flexible and can be adopted by other investigators, ensuring that the evaluation framework remains relevant as new tools emerge.
Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.
Show abstract
Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.
Komba, P.; Simmonds, G.; Dunbar, E. L.; Bundy, K.; Irving-Mattocks, N.; McDowell, M.; Ghee, A. E.; Puttkammer, N.
Show abstract
Background Continuous Quality Improvement (CQI) is a core strategy for strengthening health systems, yet documentation and monitoring of CQI activities remain fragmented in many low- and middle-income country (LMIC) settings. In Jamaica, CQI has been institutionalized across priority programs, but it largely relies on paper-based tools and basic digital platforms that limit timely learning and oversight. To address these gaps, Jamaicas Ministry of Health and Wellness (MOHW), in collaboration with the Caribbean Training and Education Centre for Health (C-TECH), adapted a web-based CQI application using a participatory, human-centered design approach. Methods We conducted a formative, convergent mixed-methods evaluation across 24 healthcare facilities to assess early-stage implementation of the CQI app. Guided by the Implementation Outcomes Framework, we examined acceptability, adoption, appropriateness, and feasibility. Quantitative data were collected through a structured survey of healthcare workers (n=43), and qualitative data were gathered through five focus group discussions (n=33) and three key informant interviews with CQI leads. Survey data were summarized descriptively, and qualitative data were analyzed using rapid qualitative analysis. Findings were integrated using joint displays. Results Survey findings indicated moderate to high perceived acceptability and appropriateness of the CQI app, with 70% of participants reporting that it saved time and 67% noting that it aligned with facility goals. However, 19% reported never using it. Qualitative findings highlighted the apps value for improving CQI documentation, visualizing trends, and supporting supervisory oversight. Key barriers to sustained adoption included inconsistent internet connectivity, limited follow-up training, unclear team roles, and challenges integrating app use into routine workflows. Leadership engagement and alignment with existing CQI structures emerged as critical enablers. Conclusion This formative evaluation suggests that a digitally enabled CQI platform can strengthen documentation and oversight of quality improvement activities in resource-constrained health systems when embedded within supportive organizational and infrastructural contexts. Addressing foundational system readiness, including leadership engagement, capacity-building, and workflow integration, will be essential to realizing the CQI apps potential in Jamaica and similar LMIC settings.
Jiang, Q.; Ke, Y.; Sinisterra, L. G.; Elangovan, K.; Li, Z.; Yeo, K. K.; Jonathan, Y.; Ting, D. S. W.
Show abstract
Coronary artery disease is a leading cause of morbidity and mortality. Invasive coronary angiography is currently the gold standard in disease diagnosis. Several studies have attempted to use artificial intelligence (AI) to automate their interpretations with varying levels of success. However, most existing studies cannot generate detailed angiographic reports beyond simple classification or segmentation. This study aims to fine-tune and evaluate the performance of a Vision-Language Model (VLM) in coronary angiogram interpretation and report generation. Using twenty-thousand angiogram keyframes of 1987 patients collated across four unique datasets, we finetuned InternVL2-4B model with Low-Rank Adaptor weights that can perform stenosis detection, anatomy labelling, and report generation. The fine-tuned VLM achieved a precision of 0.56, recall of 0.64, and F1-score of 0.60 for stenosis detection. In anatomy segmentation, it attained a weighted precision of 0.50, recall of 0.43, and F1-score of 0.46, with higher scores in major vessel segments. Report generation integrating multiple angiographic projection views yielded an accuracy of 0.42, negative predictive value of 0.58 and specificity of 0.52. This study demonstrates the potential of using VLM to streamline angiogram interpretation to rapidly provide actionable information to guide management, support care in resource-limited settings, and audit the appropriateness of coronary interventions. AUTHOR SUMMARYCoronary artery disease has heavy disease burden worldwide and coronary angiogram is the gold standard imaging for its diagnosis. Interpreting these complex images and producing clinical reports require significant expertise and time. In this study, we fine-tuned and investigated an open-source VLM, InternVL2-4B, to interpret and report coronary angiogram images in key tasks including stenosis detection, anatomy identification, as well as full report generation. We also referenced the fine-tuned InternVL2-4B against state-of-the-art segmentation model, YOLOv8x, which was evaluated on the same test sets. We examined how machine learning metrics like the intersection over union score may not fully capture the clinical accuracy of model predictions and discussed the limitations of relying solely on these metrics for evaluating clinical AI systems. Although the model has not yet achieved expert-level interpretation, our results demonstrate the potential and feasibility of automating the reporting of coronary angiograms. Such systems could potentially assist cardiologists by improving reporting efficiency, highlightning lesions that may require review, and enabling automated calculations of clinical scores such as the SYNTAX score.
Henson, J. C.; Spears, G. L.; Daughdrill, B. K.; Hagood, J. N.; Vallurupalli, S.
Show abstract
Background: Cardiac rehabilitation (CR) is a cost-effective, evidence-based intervention that improves outcomes for patients with heart failure (HF), yet access remains inequitable, particularly among Medicaid enrollees. This study evaluates the state-by-state variability in Medicaid coverage for CR services and examines the implications for health equity in vulnerable populations. Methods: We conducted a cross-sectional policy analysis of all 50 U.S. states to assess Medicaid coverage for outpatient CR services billed under CPT codes 93797 (without ECG monitoring) and 93798 (with ECG monitoring). Publicly available Medicaid documents were reviewed and supplemented with direct communication with state Medicaid agencies. States were categorized into full, partial/inconclusive, or no coverage. Geographic trends were visualized through heat maps and contextualized using state-level Medicaid enrollment data. Results: Marked disparities in CR coverage were identified. Only 41 states reimbursed for CPT 93797, and 43 for CPT 93798. Eight states lacked coverage for either code, predominantly in the South and Mountain West, including Arkansas, Georgia, Louisiana, Mississippi, Nevada, and Utah. States with the highest Medicaid enrollment (e.g., Louisiana, Arkansas) often provided no CR coverage, compounding access barriers for high-risk, low-income populations. Conclusions: The absence of standardized Medicaid coverage for CR contributes to systemic inequities in cardiovascular care, disproportionately impacting disadvantaged communities. Aligning Medicaid policies to ensure universal CR access--particularly through tele-rehabilitation and value-based care models--could reduce hospitalizations, improve survival, and promote health equity across the U.S.
Phillips, V.; Woodwal, P.
Show abstract
BackgroundArtificial intelligence and machine learning (AI/ML) are among the fastest-growing domains in NIH research funding, but whether children have shared equitably in this expansion is unknown. We characterized pediatric representation in NIH AI/ML funding from fiscal years (FY) 2020 to 2024. MethodsNIH grant data were obtained from Research Portfolio Online Reporting Tools Expenditures and Results bulk files for FY2020 to FY2024. AI/ML grants were identified using the NIH Research, Condition, and Disease Categorization "Machine Learning and Artificial Intelligence" category, and pediatric grants using the "Pediatric" category. Subprojects were excluded. Grants were deduplicated within each fiscal year by core project number for trend analyses and across all years retaining the most recent fiscal year for cross-sectional totals. Disease areas were identified by keyword searches of titles and abstracts. ResultsAcross FY2020 to FY2024, 5,624 unique NIH AI/ML grants totaling $3,371 million were identified. Of these, 836 grants (14.9%) were classified as pediatric, representing $401 million (11.9%) of total NIH AI/ML funding. Although this share was consistent with the historically reported overall NIH pediatric funding baseline of approximately 10% to 12%, it remained substantially below the US pediatric population share of approximately 22%. The pediatric share of NIH AI/ML funding declined from 12.3% in FY2020 to 10.8% in FY2024, despite growth in absolute pediatric funding. Indexed to FY2020, pediatric AI/ML funding grew approximately 2.6-fold compared with 3.0-fold growth in the total portfolio. Across disease areas, unadjusted adult/general-to-pediatric funding ratios ranged from 2.0-fold in mental health to 9.8-fold in cancer. ConclusionsPediatric representation in NIH AI/ML funding remained low and declined over time as the overall portfolio expanded. These findings suggest that growth in NIH AI/ML investment has not been matched by proportional gains for pediatric research.